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ABSTRACT 

We present a modification of Simon's Algorithm 1,2 that in some cases is able to fit experimentally obtained data 
to appropriately chosen trial functions with high probability. Modulo constants pertaining to the reliability and 
probability of success of the algorithm, the algorithm runs using only 0(polylog(\Y\)) queries to the quantum 
database and 0(polylog( \X\, \ Y\)) elementary quantum gates where \X\ is the size of the experimental data set 
and |y| is the size of the parameter space. We discuss heuristics for good performance, analyze the performance 
of the algorithm in the case of linear regression, both one-dimensional and multidimensional, and outline the 
algorithm's limitations. 

Keywords: quantum computation, artificial intelligence, modeling, regression 

1. INTRODUCTION 

Assume that two experiments are run to determine the parameters governing a physical phenomenon. If very 
few data points are obtained, then it is difficult to determine the best parameters via a cursory examination of 
a data plot. 



In this case, one might abandon the plot and use a computer to quickly calculate the best parameter range 
using standard techniques. On the other hand, if numerous experimental data points are obtained and a plot 
of the data is saturated with such points, one would have far less trouble detecting a good fit to the data; of 
course, given such a large quantity of data, the same computer would most likely take far longer for its analysis 
(assuming that no sampling techniques are used to pare down the data). 
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There appears to be an relationship between the case with which computers and people analyze data. 

The point of this mental exercise is to note that classical computers do not have the human ability to "eyeball" 
a graph, to capture the shape of a graph by evaluating the data it contains "all at once" . One essential difference 
between classical (and even classical probabilistic) computation and quantum computation is the ability of the 
quantum computer to evaluate a function at more than one point at a time. For example, given a function / 
on a single bit, a quantum computer has the ability to evaluate / at the point |a;) = -^\0) + -j= an x value 

that has no meaning classically*, in order to realize both /(0) and /(l) in the resulting state. The simplest and 
earliest use of this interesting property in tandem with quantum interference is Deutsch's algorithm 4 which is 
able to extract global properties of / in fewer evaluations than was possible classically. In this paper, we will 

'Throughout this paper, we will use the notation of the vector formulation of quantum computation as outlined in 
Chuang and Nielsen. 3 



show how classical artificial intelligence tasks, in particular, fitting a mathematical model to given experimental 
data, may in some cases be sped up via the use of quantum computation and quantum data representation. 

Throughout the paper, we use / consistently to represent the experimental data. We perform a physical 
experiment and obtain classical data that we then store in a quantum database indexed by the independent 
variables of the experiment. Thus, the action of / on a basis vector |a;) is to return the value of the experimental 
data when the independent variable was x. More generally, if a superposition of basis vectors is entered into 
the quantum database then the database returns the appropriate superposition of outputs. In other words, we 
define the quantum database by the following action^: 

/: ( c x \x)) ® \b) i-» Cx\x)®\b®f(x)) 
xex xex 

(This database formulation is roughly the same computational model as the one used in Grover's quantum 
database search. 5,6 ) We will use the variable y to represent the parameter vector and g y to represent the pa- 
rameterized function to which we wish to fit the data. We will assume that g y is a classical function implemented 
with quantum gates. (For example, if f(x) — 3y/x, then a good guess for g y is g y {x) = y^fx. Note that x and y 
need not always be single numbers. See Section 3.) 

In Section 2, we review some mathematical preliminaries and give a rigorous grounding to the notion of 
what it means for two functions to have a "similar shape". In this section, we also discuss previous research 
and give several references on the topic of multidimensional functional optimization. The precise mathematical 
implementation of the various functions, the main algorithm, and a discussion of the query/time-complexity are 
presented in Section 3. A rigorous mathematical analysis of the performance of the algorithm in the case of 
one-dimensional and multivariable linear regression is given in Section 4. An example of the algorithm running 
on a nonlinear sample input as well as certain tedious mathematical justifications arc in the Appendix, Section 
5. 

2. MATHEMATICAL PRELIMINARIES 
2.1. The Shape Measure 

Assume that we are given a function / and another function g y that is supposed to approximate /. Both are 
defined on the domain X. We ask the following question: For a given value of y, how good an approximation 
to / is g y 1 Usually, when such a question is asked, the "accuracy" of the approximation is quantified in a single 
number. The most common measure is the sum of the squared distances: 

L 2 (f,g y ) = J2\f(x)-g y (x)\ 2 

xex 

Analysis of this L 2 measure under the assumption that g y (x) = yx leads to the methods of linear regression, 7 
probably the most standard data fitting technique in use today. 

Unfortunately, analysis of the L 2 measure is not always easy. In the case where the function / is given by 
experimentally obtained data, the expression given by L 2 (f,g y ) has \X\ terms in it. If \X\ is large, y is a vector 
with many elements, or the parameterized function g y is not well-behaved (as opposed to linear regression), 
minimizing L 2 (f,g y ) can be quite a difficult and resource-consuming problem. Global function optimization has 
been an active field of research for years. In particular, we refer the reader to the brief review of continuous 
global optimization by Pinter. 8 Numerous Internet sites offer publicly available global optimization software that 

^For completeness, we present an alternate physical scenario in which our algorithm would apply equally well. We are 
given a black box that takes as input a vector of input and scratch qubits and implements the function /. The black 
box performs some unknown quantum operations on the qubits and outputs the results of the computation of / and 
possibly other output scratch qubits that remain in safe storage. We assume that there exists a basis (and, without loss 
of generality, one should assume it is the standard basis) such that the action of the experiment function / is as follows. 
For any basis vector \x), f : \x) C8> \b) i— » \x) (g) \b® f(x)) where \f(x)) is also a basis vector. Our goal is to model the black 
box experiment / by some classical function. 



take as input a user-defined black-box function. These programs attempt to minimize the number of function 
evaluations necessary to obtain the global maximum/minimum. We cite as examples those listed in the software 
review by Mongeau et al. 9 : adaptive simulated annealing, 10 clustering algorithms, 11 ' 12 genetic algorithms, 13 
multi-level random search, 1 Monte-Carlo based random search, 15 and various combinations. 16 

In order to get a relatively small value for L 2 , it is necessary to choose y such that g y is vertically close to 
the function /. When data is collected experimentally, most of the time scientists are interested in the shape of 
the graph rather than the vertical intercept because the shape has more physical significance. For this reason, 
we define a measure to compete with the L 2 standard, the shape measure* Q, as follows. 

Definition 1. 

Q(f,g y ,M) = \J2 ryj 1 2 

xex > > 

The parameter M should be thought of as the "sensitivity" of the measure Q: the lower the value of M, the 
more sensitive the measure is to small perturbations in the shapes of the graphs of / and g y . The measure Q is 
somewhat counterintuitive in the sense that the more similar the shapes of / and g are, the larger the value of 
Q(f,g y ,M) and vice versa. 

We can think of this function intuitively as follows. Imagine a compass on a unit circle initially pointing 
East. For each value of x, move the compass pointer around the unit circle clockwise if f(x) > g y {x) and 
counterclockwise if f(x) < g y (x). The greater the difference between f(x) and g y {x), the further around the 
compass you move. If we can arrange it so that the most we can move around the compass is, for example, under 
| radians, then if f(x) is much greater than g y (x), the compass will be pointing close to North. Similarly, if 
f(x) is much less than g y (x), the compass will be pointing South. Given the compass direction for each value 
of x, walk one unit for every direction that the compass points. Squaring your final distance from the origin is 
what Q(f,g y ,M) measures. 

Note that Q(f,g y ,M) gives the same distance no matter which direction we arbitrarily call "East". This 
special property corresponds to the invariance under vertical translation that we need. 

Lemma 2.1. For any real numbers ki and k 2 and any sensitivity M , Q(f, g y , M) — Q(f + k\, g y + k 2 , M). 

We now justify the use of the term "sensitivity" for the parameter M. Intuitively, the following lemma 
states that as the sensitivity M increases, the two functions are differentiated with certainty. As the sensitivity 
decreases, they tend to look approximately the same. 

Definition 2. For any real number a, let 



frac(a) = 




- [a\ ifa>0 

— \a\ if a < 



(Intuitively, frac{a) is just the fractional part of the number a.) 

Lemma 2.2. For any functions f ^ g y , assume that as M — > + 7 frac{ ^~^ v ) becomes uniformly distributed in 
(-§,§). Then 

lim Q(f,g y ,M) = l 

M— >oo 

and 

Proof. The first limit is clear from elementary calculus even without the assumption. We sketch the estimation 
for the second limit. 

"'"The function Q that we define in this section is not a metric in the strict mathematical sense. It is, however, intuitively 
a good measure of separation of functions relative to a given sensitivity M. We will show in the Appendix, Section 5.1, 
that with high probability, there exists a value of M for which there is a simple modification of Q that leads to a true 
metric with high probability and we bound it between reasonable modifications of L\ and L2. In the rest of the paper, 
we continue to use Q as our measure of interest even though it is not technically a metric. 



Firstly, note that as M — > + , by the assumption frac( ^ v ) is uniformly distributed in (— |, |). Recalling 
the formula for M), it is clear that for small enough M, each step eT^/W-SsC*)) is uniformly distributed 

around the unit circle. By elementary probability theory, the expected value of the squared distance for a random 
walk in two dimensions is \X\. Thus, taking into account the \X\ in the denominator, we get the estimate p^y 
for Q(f, g y , M) for small enough M. □ 

2.2. The Parameter Spectrum 

When attempting to search for parameters that fit within a given mathematical model for a scientific phenomenon, 
rather than confine the final value of the parameterization to the single "best" value (as is usually done with 
linear regression and other methods of model fitting), we propose a different approach. Consider the spectrum 
of all possible parameter values. Obviously, some parameter values will better fit the experimental data than 
others. If the mathematical model is good enough, there should be a small range of parameter values that are, 
in some sense, much better than all the others. Identifying this small range, the range of parameter values that 
are "good enough" , given a predetermined error tolerance, is the goal of our parameter search. 

Assume we are given the usual approximation function g y (x) where x is the vector of independent variables 
and y is the parameter vector. Let Y be the set of all possible values for the vector y. For a given parameter 
range Y' C Y, we define the parameter ratio r(Y', Y) as follows. 

Definition 3. 

= Z ye y,Q{f,9v,M) 

Note that MY' C Y, < r(Y',Y) < 1. Recall that the higher the value of Q, the better the shape fit between 
/ and g y . Thus if the shape fit for the parameter values in Y' are better than the other values in Y, this ratio 
r(Y' , Y) should be significantly higher than other possible ranges. 

In the algorithm to follow in Section 3.3, we will reduce the parameter space based on the values of the 
parameter ratios. We assume that there is a predetermined threshold associated with these parameter ratios in 
the following sense. If the parameter ratio of the parameter range Y' C Y is greater than, for example, 65%, we 
might say that we are certain enough that the optimal parameter range is somewhere in Y' . If we can determine 
a Y' with a parameter ratio 65%, all further searches of the parameter space can be restricted to the space Y' . 
Obviously, the number 65% is arbitrary and left up to the individual scientist running the experiment. 

3. THE ALGORITHM 

We will continue using the functions / and g y as defined previously. We assume that / and g y both evaluate 
to nonnegative integers for every value of the vectors x and y. (This restriction may be removed via simple 
transformations of the functions / and g y , but we will assume it here for simplicity.) We assume that we can 
evaluate the functions / and g y out to any given fixed precision. 

Consider the following simple two-dimensional example: We have a quantum oracle that evaluates the function 
f(x\,X2) = 5x! + 16x 2 + noise where the noise term refers to a small random variable contributed by the 
environment or other factors. We have deduced that / evaluates a function that looks roughly like a plane. 
Thus, our approximation function will be gr yi )J/2 ) [x\, x 2 ) = y\X\ + y 2 x 2 - Our aim, of course, is to find the best 
range for the parameters y\ and y 2 . Let us assume that we know that the parameters can only take values less 
than 32. We will therefore dedicate 5 bits for each and start our search space out with (j/i, y 2 ) € [0, 31] x [0, 31]. 
Note that we could have, given more analysis of the data, restricted the variable y\ to only 3 bits without losing 
any pertinent information. However, if we suspected that the parameter y\ included a fractional part (i.e. if 
we suspected that f(xi, X2) — 5.25xi + 16x2 + noise), we could keep the extra bits and use them for additional 
precision by modifying the approximation function as follows: gr yitV2 \(xi, X2) — ^x\ + y 2 x 2 . Negative numbers 
and other mathematical possibilities can be handled in a similar way. 

In order to make the discussion of Section 1 more precise, we assume that we have a black box which computes 
both the functions / and g y to N digits of binary precision in the following sense. On basis vectors, our evaluator 
for / computes (where ® is binary addition mod 2 N ) 

\xi,x 2 ) ® \b) 1 ► \xi,x 2 ) ® \b® f(xi,x 2 )) 



where each |-) represents a bit vector of the appropriate size. (\b) will have N binary digits, as specified.) We 
then extend this to the entire vector space by linearity. This is a well-known model of function evaluation and is 
the basis of most popular quantum algorithms including Shor's factoring algorithm 17 ' 18 and quantum database 
search. 5 ' 6 On basis vectors, our evaluator for g computes (where Q is binary subtraction mod 2 N ) 

\xi, £2,2/1,2/2) ® \b) >-> \x!,x 2 ,yi,y2) ® \be g( yuy2 )(x 1 ,x 2 )) 
and we again extend this by linearity. 

3.1. Heuristics for Good Performance 

Though the algorithm presented here is provably fast and accurate in some cases (see Section 4), we are not 
optimistic enough to claim that this is so in every case. It is very difficult to find rigorous general restrictions 
that will guarantee satisfactory performance. However, this section will present some common-sense heuristics 
in the form of assumptions. Sections 3.2 through 3.4 present the algorithm itself. 

1. We should assume that, for all practical purposes, By' G Y such that 

f(x) = g y '(x) + noise 

We can think of noise as a uniform random variable taking values in [— e, e] for some small e > or as a 
standard normal random variable with relatively small variance. (If the mathematical model is off by a 
quasi-insignificant term, it may be safely incorporated into the noise term.) 

2. We should assume that the noise term is small relative to the difference in model parameter. Vy 7^ y' e Y 
(where y' is defined as in assumption 1), 

Var(noise) <C Var(f — g y ) 

If the noise from the environment drowns out the mathematical model, it will be very difficult to come up 
with good parameters because any choice of parameters will likely be a bad one§ . 

3. We should assume that the mathematical model g y we ascribe to the quantum oracle / is, in a sense, 
"continuous" in the parameter y; that a small change in the y parameter should not produce too large a 
change in the quantum distance measure Q. On the other hand, we also need to assume that Q differentiates 
between good parameter choices and bad ones. Thus, we should assume that there exist constants a > 
and b < 1 such that Vg y 3M such that a < E(Q(g y , g.,M)) < b. 

We will use these heuristics as assumptions below in Section 4 when we analyze the performance of the algorithm 
on linear regression. 

3.2. Choosing the Sensitivity 

This entire subsection should be thought of as a function within a greater procedure. It determines, if possible, 
the best value of the sensitivity for Q for use in further calculations. 

Below, we use the notation X to refer to the set of all possible values for the independent variable \x). We 
initialize the value of N to be |~log 2 max xe x,yeY {/(%), ffy( x )}l > the maximum number of digits that the quantum 
evaluator needs to evaluate either / or g y . Throughout this section, we let the function FFT be the quantum 

N 1-xibk 

fast Fourier Transform 3 whose action on basis vectors is FFT : \b) 1— > J2k=o B J 2 N 

-\k). 

^Classically, before one can determine how bad the noise is, the parameters need to be chosen first. Without knowledge 
of what the model is, one can't determine how much of the data is noise. It is a quantum feature of the algorithm that if 
the environmental noise drowns out the experimental data even before we know what is and isn't noise, the algorithm will 
not be able to make a good choice for Y' . So, if there happens to be no "good" parameter choice, the algorithm simply 
won't make a choice. 



1. We first create the superposition 11 

-2x.f(x) 

xex V\ x \ 

in the following way. Start with |a;) = |0...0) and \b) = 1 ... 01) . (Assume that |6) contains N total 
binary digits.) 

|0...0)<g>|0...01)i-» 

, . 2 N -1 2^ 
x) x e 2^ 



H mo g \x\ FFT . y y *'\ k) 



xex v i i k=a 

-2»i/(«) 2 W -1 27rifc 

i€X Vl^l fc=0 VZ 

Because this system is separable, we can discard the second expression, and this leaves us with the system 
we want. 

2. We now introduce another system representing the parameter states. Let Y be the set that represents the 
current search space for the parameter vector \y). Our final goal is a system in the form 

2ir.(g i< ( 3 ;)-/(x)) 

E V - — ; 2 \ x ) ® \y) 

We initialize \y) = |0 . . . 0). Again we will make use of \b) — |0 . . . 01) with N binary digits. Using the 
system we created in the previous step, we get 

-2iri/(tt) 

J2 e 2 m k)(8»|0...0><8)l0...01> 
xex v 2 

j^loglXI 0jH -®log|r| . 

-27rif(x) 2™-l 2xife 

Ee 2« y\ e 2 « 

iEa v i i y£i v 1 1 fc=0 



5 : 

e i« | y \ a -i 2 „ fc 

^ . /TvT 1 ' ^ . /HTT ^ ,/oiV 1 ' 



sex vri ye y v 1^1 fc=o v2 w 

Again separability allows us to ignore the |6) qubits and leaves us with the desired system. 

3. Finally, we perform a Hadamard transformation on the \x) qubit system. So we get 

2 J i( gy (x)-/(x)) 

zex a er xex 1^1 V I 1 I 

4. The final step in this procedure is to measure the \z) qubits. Based on the state expansion outlined above, 
the probability that \z) = |0 . . . 0) is 

2„( a ,M-fM) 
lV|V e 2 " 12 = 

1 1 yey xex 1 1 



^This step basically follows the procedure in Shor. 18 



1 1 y£Y 

The exact value of this probability can be estimated to arbitrary accuracy with high probability by repeated 
trials using the above steps. (Make sure to keep those experiments such that \z) = |0 . . . 0) as they will 
likely be useful later. Also note the similarity of the final expression to Assumption 3 of Section 3.1.) We 
can now make our decision as to whether we have chosen an appropriate value of the sensitivity value 
2^. If our estimate of the \z) — |0 . . . 0) measurement probability is not within [ ^y, |] (arbitrarily chosen 
constants"), we proceed through the following list of checks. (Note that we may fall into category (c) after 
we have checked (a) and (b). The final check is not mutually exclusive from the other two.) 

(a) If the \z) = |0...0) measurement probability is below ^ (set above), we double the value of 2^ 
(increasing N by 1) and try again. If this happens, it means that the sensitivity was so low that we 
are only capturing the behavior of the noise in the function / and not so much the differences in the 
function g y as we vary the parameter y. 

(b) If the \z) = |0...0) measurement probability is above | (set above), we halve the value of 2 N 
(decreasing N by 1 if possible**) and try again. In this case, the sensitivity was artificially high. 
The value of 2 N was dwarfing the both the parameter differences of g y and the noise term in /. 

(c) If the \z) = |0 . . . 0) measurement probability exceeds the boundaries in the opposite direction of the 
previous experiment, we terminate the algorithm and declare that the parameter space Y cannot be 
improved upon. In other words, the predictive value of the mathematical model is not significantly 
improved by making the parameter space smaller than it already is. 



Once we get the value of N that we desire from our above experiments by assuring ourselves within a predeter- 
mined tolerance that P(\z) = |0 . . . 0)) € [ , |], we can continue on to the next phase. 

Remark. As soon as the value of ^- (and almost certainly even ^-) approaches |noise|, the algorithm is likely 
to terminate: If "^f e is a uniform random variable taking values between —\ and |, then E(Q(f, g.,2 N )) = -p^-. 
See Lemma 2.2 for justification. 

3.3. Trimming the Parameter Space 

In this section, we describe how the parameter space may be trimmed to half its size with high probability. 
Again, as with the above section, we assume that this section is a function within a larger procedure. Assume 
below that we have chosen a field j/i to work with within the parameter vector y. The ultimate goal of this 
function is to cut the possible values of yi by half correctly with high probability. 



1. Determine the optimal sensitivity 2" using the function described in Section 3.2. 

2. Collect several systems (the exact number to be determined by the error tolerance) that have already been 
measured with \z) = |0 . . . 0). Recalling that the state of the system before the \z) = |0 . . . 0) measurement 
is 



-(-ir z \z)(g>\y) 

V I . / I \ 

z<£X yl£Y x£X 



I v\ l\Y\ 
zex yeY xex I^MV I 1 I 



"These constants were chosen because of their good performance in numerical experiments by the author, but their 
exact values do not really matter. If the range is too great, it will simply take longer for other parts of the process to 
produce good results. We only desire that the probability that we measure \z) = |0. . .0) be bounded above and below 
by constants. See Assumption 3 of Section 3.1. 

**If we reach this point and it is not possible to decrease N any further (i.e. N = 1), we terminate the function and 
report that no sensitivity value will work. 



the final state of the system after the \z) = |0 . . . 0) measurement is 



, 57J 



EE 



\y) 



yevxex VP(\z) = \0...0)) 
where P(\z) = |0 . . . 0)) is defined as in Step 4 of Section 3.2. 

3. If possible, split the parameter space of yt into 4 equal pieces by considering the 2 most significant bits 
of yi. (The case where yi only has a single bit is analogous and will be explicitly outlined below.) For 
j € [0, 3], let Yij be the subset of the parameter space such that the two most significant digits of yi equals 
j. For each of the systems collected above, measure the parameter yi and keep track of which the 
measurement falls into. The probability of measuring yi in the a given is equal to 



E 



2^i( 8y (x)-/(^)) 



|2 



P(|z) = |0...0)) 

We will consider narrowing our search for the optimal parameter value to either Yto U Yn, Yn U Ya 1 or 
Y{2 U 3^3^. Recalling the discussion of Section 2.2, we take into account just how different we need the 
parameter value to be in order to perform this narrowing of the parameter space^ . Once we have made our 
choice as to which part of the space we are interested in examining, we modify the function g y , replacing 
yi with the appropriate linear transformation (See Section 3.4.) and reducing the number of search bits in 
yi by one. We then continue on with the algorithm. 



It remains to discuss what occurs when the search space cannot be split into 4 equal pieces. In this 
case, yt is either or 1. We simply run the analogous experiment described above for the quarter case and 
choose the correct half of the space based on the given threshold. Once we have made the final decision if 
possible, yi is eliminated as a parameter. 

If it turns out that we cannot meet the threshold for trimming the parameter space, we mark yt finished 
as a parameter and continue on to j/i+i. 



3.4. The Overall Algorithm 

The algorithm amounts to repeated application of the procedure outlined in Section 3.3. 

While the parameter space continues to shrink in size: for each parameter in succession, perform the trimming 
described above in Section 3.3. Whenever a parameter is trimmed, the function g y is updated: if we are working 
on the variable yi and choose the half space Y i0 U Yn (resp. Y i2 U^), then the function g y automatically inserts 

ft At this point, the astute reader might be asking himself why we bother to split the parameter space in quarters when 
it looks like we may be able to split it in half and speed up the algorithm by a factor of 2. Let us assume that yi can 
be between and 31 and the optimal value is 15 but both 15 and 16 are acceptable parameter values for our model (see 
Section 2.2). We split the parameter space into 4 parts: [0, 7], [8, 15], [16, 23], [24, 31]. Clearly, [8, 15] should have a large 
probability mass, but so might [16, 23] if 16 is an acceptable parameter value as well. 

M One point should be clarified before proceeding further: If the parameter ratio of one of the three half-spaces does not 
meet the threshold for further narrowing of the parameter space, there are two possibilities for why: (a) The parameter 
spectrum need not be narrowed any further because the data fits the model approximately the same way for each half- 
space. In this case, no narrowing of the search is desirable, (b) The mathematical model/search space have not been 
chosen well. For example, the search space might be so large that good parameter values in a half space are drowned out 
by the large number of bad choices. 



a (rcsp. 1) for the high order bit of y% and the number of input bits for gets reduced by 1. If we choose the 
half space Yn U Yj2 and yi contains b bits, then we let the high order bit of m be 0, reduce the number of input 
bits in yi to b — 1 (similar to the 5^o U U Y^ cases), and then before evaluation of g y perform the b bit 

addition yi ^ yi + 2 b ~ 2 — 1. (The single bit case is analogous.) Once the parameter space ceases to shrink in 
size, output the reduced parameter spaces for each of the parameters. 

The Appendix (Section 5.3) contains a fully worked out nonlinear example of this algorithm. 

3.5. The Query/Time Complexity 

Modulo various constants depending on the error tolerance and threshold values, the algorithm presented above 
requires 0(polylog(\X\, \Y\)) elementary quantum operations (as defined in 3 ) and 0(polylog(\Y\)) evaluations of 
the functions / and g y . 

By way of comparison, we examine the classical running time for finding the best fit for a trial function 
using \X\ experimental data points and parameter space of size \Y\. The standard technique is to minimize 
the function L2(f,g y ). Note that simply writing down the function L2(f,g y ) requires time f2(|X|) (if we do 
no probabilistic sampling of the search space). Minimizing L2(f,g y ) with respect to the parameter vector y 
requires either solving the equations VL2(/, gy) = or minimizing L2(f,g y ) directly using a global optimization 
protocol. (Writing down the equations VL2(/, g y ) = requires time proportional to both \X\ and the number of 
parameters in the parameter vector y. Note that in the case where the number of parameters in the parameter 
vector y is large, say f2(|F| c ) for some small c < 1, it takes time f2(|A| |y| c ) just to write down the equations to 
be solved.) If the functions g y are sufficiently nasty, this could necessitate solving equations numerically using a 
probabilistic method, genetic algorithm, or other previously mentioned sampling technique. If we have to sample 
0(|y| fe ) times for some constant k < 1, then the classical running time for finding the best fit parameter vector y 
could reach 0(|A||y| fe ). This is exponentially worse than the quantum algorithm presented here. If probabilistic 
sampling from the search space is done, the quantum algorithm and classical probabilistic algorithms still perform 
at asymptotically similar speeds. 

In conclusion, the inherently quantum nature of interference allows potentially exponential speedup for suf- 
ficiently well-chosen parameterized functions. Constructive interference takes place when the data matches the 
function, and the quantum coefficients interfere destructively when the parameters provide a poor fit. 

3.6. Limitations and Areas of Future Research 

The algorithm we present here has some obvious limitations. In this section, we present a simple example to 
illustrate its shortcomings. 

Consider the function that is everywhere except at a single unknown point a where it takes the value 1. 
We wish to discover the value of a by fitting this function to a function of the form 



Clearly, we want the end value of the parameter y to equal a. 

At this point, note that this situation violates one of our heuristics for good performance, specifically As- 
sumption 3 in Section 3.1. These functions are not effectively separated by the quantum distance measure Q; on 
the other hand, L 2 is 1 for every value y ^ a and if y = a. We do not expect our quantum algorithm to perform 
well. Indeed, it cannot possibly perform well: this situation is equivalent to the database search problem. There 
is a well-known result that any quantum algorithm that solves the database search problem with high probability 
must make fl(\/N) queries to the database where N is the number of elements in the database. 19 In our case, 
the number of elements is equal to |Y|; because our algorithm makes only 0(polylog{\Y\)) "queries", it cannot 
possibly find the correct value in time with high probability. 

One interesting possible area of research we leave open is to determine the family of functions for which the 
algorithm is able to perform correctly with high probability. Section 4 below will guarantee good behavior from 
the algorithm in the linear and multilinear case. However, it appears that determining where the separation 




between "well-behaved" families of functions (such as the linear and multilinear cases) and "databases" is a 
very difficult problem. In Ambainis, 20 it is claimed that "the unordered search problem provides an abstract 
model for NP-complete problems." Indeed, it appears that one might just as well ask the question of where 
the separation occurs between boolean functions for which there exists a polynomial-time satisfiability algorithm 
and those that are truly intractable. 



4. ANALYSIS OF LINEAR REGRESSION 

This section will present a rigorous analysis of necessary and sufficient conditions for good behavior in the case 
where we use the algorithm to perform linear regression. 

4.1. One-Dimensional Linear Regression 

Definition 4. Using the same notation as in the previous sections, we will assume that Y = {—2 K ~ 1 , . . . , 2 K ~ 1 — 
1} and X = {0, . . . , 2 L — 1}. Then log 2 \X\ = L and log 2 \ Y\ = K . N has the same sensitivity meaning as it did 
above in Section 3. Finally, let N = L + K + r where r is a constant whose optimal value will be worked out in 
the analysis to follow. 

We will assume f(x) = y'x + noise for some y' EY (see Section 3.1, Assumption 1). So g y (x) = yx. We will 
assume that noise is a Gaussian random variable with small variance. (The mean is irrelevant because of the 
translational invariance of Q.) Recall that 
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We look to find a lower bound for this expression. Recalling Assumption 2 of Section 3.1, our first approximation 

27ri(noise) 

is that Var( n °™ e ) <C 1 =^(by translational invariance)e ^ rj 1. That reduces our expression to 
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Letting y* = ^ e [— |, i], recalling that N = L + K + r, and approximating the sum with an integral, we get 
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As L gets large, 2 L gets very large. We make the approximation that 2 L — > 00. So taking this limit and letting 

w = 2~ r -Ku, we get 

2 r r^th-v*) sinur 2 , 



Note that this expression is independent of all variables except r and y* . The following is a plot of P(\z) = 
|0 . . . 0)). The r value is on the x-axis (front) and the y* value is on the y-axis (right). 




Given this expression for P(\z) = |0 . . . 0)), it is now easy to calculate the probabilities of the three relevant 
half spaces: 

- ft!°~ yn A s -^) 2 dw 
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We can calculate the maximum of these three half spaces for any given values of r and y* . Below is a plot of 
this maximum. 




By examining these plots, we notice that if we choose r = — 1, then P(\z) = |0 . . . 0)) > 20% and the half 
space with the greatest probability must have weight at least 70%. So rather than performing the algorithm 
outlined in Section 3.2, in the case of linear regression with a single variable, we simply choose N = L + K — 1. 

Remark. It is somewhat strange that the maximum value that f(x) — g y {x) takes in this case requires L + K 
bits to compute. Our result above shows that we can ignore the highest order bit of the computation. 

4.2. Multivariate Linear Regression 

The analysis of the multivariable case parallels that of the one-dimensional. 

Definition 5. Using the same notation as the previous sections, we assume that Y = {— 2 K ~ 1 , . . . , 2 A_1 — l} xd 
and X = {0, . . . , 2 L - l} xd . Let N = L + K + r as before. 

We assume that f(x) = Y]j—x Uj x j + noise for some vector y' G Y and g y (x) — Ylj—i Vj x j- The random 
variable noise is still Gaussian with small variance. The results are as follows. 

p(i,>=io...o»«n(- r h ^\^?d W ) 



where Vj G {1, . . . , d}, y* = ^i-. The plot for the greatest of the three half spaces is exactly the same as the 
one-dimensional case. Thus, if we choose r = — 1, the probability that we measure P(\z) = |0 . . . 0)) decreases 
exponentially with d; however, there is no decrease in the relative weight of the half spaces. The mathematical 
justification for these results can be found in the Appendix (Section 5.2). 
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5. APPENDIX 



5.1. The Pseudometric Q* 

This section introduces a pseudo-metric Q* = 1 — ^/Q (for small enough M). The function 1 — y/Q is a metric 
"with high probability" in the sense that the properties of a metric are satisfied with high probability given 
appropriate conditions that we outline. We introduce this function in order to relate the new measure Q to 
other, more commonly used metrics like L\ and Li. 

We must be careful about describing the metric space over which we claim Q* as a pseudo-metric. Consider 
the space of all real- valued functions /(x). Wc identify two functions / and g as equivalent if / = g + k for some 
constant k. We claim that Q* is a metric for the resulting space of equivalence classes. We denote by / the 
equivalence class of the function /. Note that by translational invariance (Lemma 2.1), Q* is well-defined over 
this space. 

Lemma 5.1. The function Q* = 1 — ^/Q is a metric over the set of translational equivalence classes of functions 
with high probability for small enough values of M and large enough \X\. 

Proof. We need to check three conditions. Only the triangle inequality is nontrivial. We need to show that 
for functions /, g, and h, (1 - y/Q{f,g,M)) + (1 - \/Q{g, h, M)) > 1 - y/Q(f,h,M). Note that by Lemma 2.2, 
if M gets small, all of the Q values approach pq- with high probability. If \X\ is large enough, the inequality is 
true by inspection. □ 

It is also possible to bound Q* between relevant modifications of two well-known metrics, L\ and L^. 
Lemma 5.2. Let 

\f{x)-g{x)\ 



Ll(f,g,M)=J2 

x£X 

and 
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L 2 {f,g,M) - 2^ , , 2 



\X\M 2 
xex 1 1 

(Note that L\ and L 2 are not well-defined on the space of vertically equivalent functions. L\ is an average 
normalized distance and L 2 is an average normalized squared distance.) Then there exists M > such that 

2nLUf,9,M) + (1 - |) < <?(f,g,M) < 2tt 2 L* 2 (f , g , M) 



Proof. Let Mi = max l£X 4|/(ar) - g{x)\. Then we have that Vx e X, \ f{x) M f x) \ < \. Given this value for 
Mi, we now perform the random walk given by Q described in Section 2.1. An appropriate vertical modification 
to the function /, say replacing / with / + fci for some constant fci, will suffice to place the endpoint of the 
random walk on the x-axis. If it is still true that Vx e X, | ^ a j^ g ^ i < \, then wc let M = M\. Otherwise, given 
the new function / + k\ , we repeat the process which leads to M 2 and / + k\ + k 2 for some constant k 2 . We 
continue evaluating Mi until we have found our value of M. Note that these steps must eventually terminate 
because the size of max Ig x \ ^ X ^ M 9 ^ I is decreasing substantially with every iteration. 

Note that after the process of determining M has completed, / has been modified to / + k\ + k<i + . . . + k n 
(if there are n iterations) and the random walk is guaranteed to terminate on the x-axis. Thus, using standard 
approximations for sin(x) and cos(x), we get 
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Algebra yields the resulting bounds. □ 

Remark. Note that the bounds given in Lemma 5.2 are essentially tight. On the right, if / = g, then 
Ll(f,g,M) = => Q(f,g,M) = 1. On the left, if / is as far from g as possible, then L\{f,g,M) = j => 
Q(f,g,M) = 0. 

5.2. The Multivariate Mathematics 

We will be using the same notation as that given in Section 4.2. Recall that 
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and if we assume that the effect of the noise term is negligible, we get 
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Now, we perform the same one-dimensional analysis for each j term to get the final result. 

« 11 — / ( ) dw) 

To see that the half space calculations turn out the same, note that we only change the top two bits of a single 
variable at a time. Only the j term for that variable changes in the half space calculation. All other variables 
cancel out. 

5.3. A Nonlinear Example 

This section performs the algorithm of Section 3 on a specific example. We will use f(x\,X2) = x\ + 16a; 2 + 
constant + noise where noise is a random variable that take values uniformly between -30 and 30 and constant 
is a vertical constant that we care nothing about. Data has been collected from the domain [0, 31] x [0, 31]. The 
parameter function we use is g(x\, X2, j/i, JJ2) = V\x\ + U2X2- We will use 3 bits for y\ and 5 bits for j/2- We use a 
threshold of 60% when performing Step 3 of Section 3.3. Our acceptable range for measuring \z) = |0 . . . 0) will 
be [jfi, |] = [.1, .6]. The following is a 3D graph of the function f(x\, x 2 ). 



1. Begin by working on parameter yx G [0, 7]. 

(a) We initialize 2 N = 2 11 => P(\z) = |0 . . . 0)) = 0.265781 G [.1, .6]. This sensitivity is accepted. 

(b) The probabilities of measuring y\ in the relevant ranges are as follows: 

• P{yi G [0,1]) = 0.643622 

• P(yi G [2,3]) = 0.230168 

• P{yi G [4,5]) = 0.0778675 

• P{yi G [6,7]) = 0.0483422 

We accept the range y\ G [0, 3] with total probability 0.87379. (tolerance 60%) 

(c) Recalculating with 2 N = 2 11 P(\z) = |0 . . .0)) = 0.448745 G [.1, .6]. This sensitivity is accepted. 

(d) The probabilities of measuring y\ in the relevant ranges are as follows: 

• P(yi G [0,0]) = 0.217404 

• P(yi G [1,1]) = 0.519182 

• P(yi G [2,2]) = 0.217218 

• P(yi G [3,3]) = 0.046196 

We accept the range y\ G [0, 1] with total probability 0.736586. (tolerance 60%) 

(e) Recalculating with 2 N = 2 11 => P(\z) = |0 . . .0)) = 0.448745 G [.1, .6]. This sensitivity is accepted. 

(f) We have narrowed y\ to two possible values. 

• P[yx = 0) = 0.295151 

• P{yi = 1) = 0.704849 

We accept y\ = 1. (tolerance 60%) 

2. Working on parameter y% G [0,31]. 

(a) Recalculating with 2 N = 2 11 P(\z) = |0 0>) = 0.931922 g [.1,.6]. This sensitivity is not 

accepted. 

(b) Recalculating with 2 N = 2 10 =>■ P(|z) = |0...0)) = 0.764582 ^ [.1,.6]. This sensitivity is not 
accepted. 

(c) Recalculating with 2 N = 2 9 P(\z) = |Q...0)) = 0.431486 G [.1, .6]. This sensitivity is accepted. 

(d) The probabilities of measuring y% in the relevant ranges are as follows: 

• P{y 2 G [0,7]) = 0.0546121 

• P(y 2 G [8,15]) = 0.40125 

• P{y 2 G [16,23]) = 0.453112 

• P{y 2 G [24,31]) = 0.0910255 

We accept the range y 2 G [8, 23] with total probability 0.854362. (tolerance 60%) 

(e) Recalculating with 2 N = 2 9 P(\z) = |0 . . .0)) = 0.737291 £ [.1, .6]. This sensitivity is not accepted. 



(f) Recalculating with 2 N = 2 s => P(\z) = |0 . . .0)) = 0.375476 G [.1, .6]. This sensitivity is accepted. 

(g) The probabilities of measuring y 2 in the relevant ranges are as follows: 

• P(V2 G [8,11]) = 0.0398828 

• P{y 2 G [12,15]) = 0.371314 

• P(y 2 e [16,19]) = 0.475369 

• P{y 2 G [19,23]) = 0.113435 

We accept the range y 2 G [12, 19] with total probability 0.846683. (tolerance 60%) 

(h) Recalculating with 2 N = 2 8 => P(|z) = |0 . . . 0» = 0.635818 £ [.1, .6]. This sensitivity is not accepted. 

(i) Recalculating with 2 N = 2 7 = |0 . . . 0» = 0.206041 G [.1, .6]. This sensitivity is accepted, 
(j) The probabilities of measuring y 2 in the relevant ranges are as follows: 

• P(y 2 e [12,13]) = 0.0170483 

• f(j/2 G [14,15]) = 0.300177 

• P(V2 G [16,17]) = 0.511306 

• P(y 2 G [18,19]) = 0.171468 

We accept the range j/2 G [14, 17] with total probability 0.811483. (tolerance 60%) 

(k) Recalculating with 2 N = 2 7 P(|z) = |0. ..0)) = 0.334398 e [.1, .6]. This sensitivity is accepted. 
(1) The probabilities of measuring y 2 in the relevant ranges are as follows: 

• P{V2 G [14,14]) = 0.115462 

• P(y 2 G [15,15]) = 0.254449 

• P(y 2 G [16,16]) = 0.336552 

• P(y2 G [17,17]) = 0.293536 

We accept the range y 2 e [16, 17] with total probability 0.630089. (tolerance 60%) 

(m) Recalculating with 2 N = 2 7 P(|z) = |0. ..0)) = 0.334398 G [.1, .6]. This sensitivity is accepted, 
(n) We have narrowed y 2 to two possible values. 

• P(V2 = 16) = 0.534135 

• P{y 2 = 17) = 0.465865 

We cannot accept either, (tolerance 60%) 



The final answer is therefore y\ = 1 and y 2 G [16, 17] with g{x\, x 2 , yi, y 2 ) — yix\ + y 2 x 2 . 



